Graphic User Interface (GUI) is facing great demand with the popularization and prosperity of mobile apps. Automatic UI code generation from UI design draft dramatically simplifies the development process. However, the nesting layer structure in the design draft affects the quality and usability of the generated code. Few existing GUI automated techniques detect and group the nested layers to improve the accessibility of generated code. In this paper, we proposed our UI Layers Group Detector as a vision-based method that automatically detects images (i.e., basic shapes and visual elements) and text layers that present the same semantic meanings. We propose two plug-in components, text fusion and box attention, that utilize text information from design drafts as a priori information for group localization. We construct a large-scale UI dataset for training and testing, and present a data augmentation approach to boost the detection performance. The experiment shows that the proposed method achieves a decent accuracy regarding layers grouping.
translated by 谷歌翻译
传统的2D动画是劳动密集型的,通常需要动画师每秒手动绘制十二例证。虽然自动帧插值可以缓解这种负担,但是与在光电环境域中相比,2D动画所固有的艺术效果使视频合成特别具有挑战性。较低的帧射击导致较大的位移和闭塞,离散的感知元素(例如,线条和固色区域)对面向纹理的卷积网络构成困难,并且夸张的非线性运动阻碍了训练数据收集。以前的工作尝试解决这些问题,但使用了不可提供的方法并专注于像素完美的性能。相比之下,我们建立一个可扩展的系统,更适当地以这种艺术领域的感知质量为中心。首先,我们提出了一种轻量级架构,具有简单而有效的遮挡技术,可以提高具有较少可训练参数的感知度量的收敛性。其次,我们设计一种新颖的辅助模块,利用欧几里德距离变换来改善键线和区域结构的保存。第三,我们通过量化移动非线性来自动为此任务加倍现有的手动收集的数据集,允许我们改善模型泛化。最后,我们通过用户学习确定PSNR和SSSIM的LPIP和倒角距离,验证我们的系统对2D动画域中的感知质量的强调。
translated by 谷歌翻译
人类姿势信息是许多下游图像处理任务中的关键组成部分,例如活动识别和运动跟踪。同样地,所示字符域的姿势估计器将在辅助内容创建任务中提供有价值的,例如参考姿势检索和自动字符动画。但是,虽然现代数据驱动技术在自然图像上具有显着提高的姿态估计性能,但是对于插图来说已经完成了很少的工作。在我们的工作中,我们通过从域特定的和任务特定的源模型有效地学习来弥合这个域名差距。此外,我们还升级和展开现有的所示姿势估计数据集,并引入两个用于分类和分段子任务的新数据集。然后,我们应用所产生的最先进的角色姿势估算器来解决姿势引导例证检索的新颖任务。所有数据,模型和代码都将公开可用。
translated by 谷歌翻译
几何和语义上的全面3D场景理解对于机器人感知等现实世界应用都很重要。现有的大多数工作都集中在开发以数据驱动的判别模型来理解现场。从合成模型的角度来看,本文通过利用隐式3D表示和神经渲染的最新进展,提供了一种新的场景理解方法。在神经辐射场(NERFS)的巨大成功之下,我们与NERF(SS-NERF)介绍了场景 - 陶艺合成,不仅能够从新颖的角度呈现照片真实的RGB图像,还可以使各种准确的场景属性(例如,外观,几何和语义)。通过这样做,我们便有助于解决统一框架下的各种场景理解任务,包括语义分割,表面正常估计,重新载体,键盘检测和边缘检测。我们的SS-NERF框架可以成为弥合生成学习和歧视性学习的强大工具,因此有益于研究广泛有趣的问题,例如在综合范式中研究任务关系,将知识转移到新颖的任务中,促进知识,促进下游判别任务是数据增强的方式,并作为数据创建的自动标签者。
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted average of mean and percentile performances, and it covers the distributionally robust MDPs and the distributionally robust chance-constrained MDPs (both under reward ambiguity) as special cases. By considering that the unknown reward distribution lies in a Wasserstein ambiguity set, we derive the tractable reformulation for our model. In particular, we show that that the return-risk model can also account for risk from uncertain transition kernel when one only seeks deterministic policies, and that a distributionally robust MDP under the percentile criterion can be reformulated as its nominal counterpart at an adjusted risk level. A scalable first-order algorithm is designed to solve large-scale problems, and we demonstrate the advantages of our proposed model and algorithm through numerical experiments.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译